Skip to content

feat: add FireworksTrainingRolloutProcessor for RFT (FIR2-1351)#445

Closed
benjibc wants to merge 1 commit intomainfrom
cursor/managed-rl-training-rollout-processor-135f
Closed

feat: add FireworksTrainingRolloutProcessor for RFT (FIR2-1351)#445
benjibc wants to merge 1 commit intomainfrom
cursor/managed-rl-training-rollout-processor-135f

Conversation

@benjibc
Copy link
Copy Markdown
Contributor

@benjibc benjibc commented Apr 22, 2026

Description

Adds a new default RolloutProcessor subclass that drives Fireworks /v1/completions via FireworksV1CompletionsClient and surfaces the per-sample token ids, completion ids, and inference logprobs required by reinforcement-fine-tuning training loops (GRPO, CISPO, DAPO, GSPO).

Problem

SingleTurnRolloutProcessor uses LiteLLM chat completions and discards token-level data, so scored EvaluationRows are fine for evaluation but cannot feed a training loop. Today, teams that need training-ready rollouts write a bespoke RolloutProcessor — the FrozenLake example in fw-ai/cookbook is ~800 lines. This puts training-compatible rollouts out of reach of every customer evaluator bundle unless they reimplement the rollout path themselves.

Fireworks' managed RFT flow needs this for every customer job, so promoting the pattern into an Eval Protocol default removes the per-customer 800-line tax.

What it does

For each EvaluationRow, FireworksTrainingRolloutProcessor:

  • Reads model / temperature / max_tokens / n from completion_params.
  • Builds prompt token ids locally via FireworksV1CompletionsClient.build_prompt_token_ids(...).
  • Fires n parallel /v1/completions calls from the same prompt_token_ids, so each completion gets independent retry behaviour rather than collapsing on partial server failures.
  • Appends the first completion as the assistant message so existing evaluators that inspect last_assistant_message() keep scoring without modification.
  • Populates EvaluationRow.execution_metadata.extra with the per-completion payload:
    • prompt_ids: list[int] (shared across completions)
    • completion_ids: list[list[int]] (one per completion)
    • inference_logprobs: list[list[float]] (aligned to completion tokens)
    • completions_text: list[str]
    • truncated: list[bool] (True when finish_reason == 'length')
    • finish_reasons: list[str]
  • Merges into pre-existing extra rather than clobbering it (coexists with OpenEnvRolloutProcessor, tracing_utils, etc.).
  • Caches one FireworksV1CompletionsClient per model id; closes them all via acleanup().

Shape rationale

OpenEnvRolloutProcessor already writes flat prompt_ids / completion_ids concatenated across turns (multi-turn, per-episode agent rollouts). Single-turn RFT samples n>1 completions per prompt for advantage estimation and needs per-completion indexing, hence the list[list[...]] shape here. A training adapter on the consumer side can key into either convention without loss of generality.

Architecture

flowchart LR
    Row[EvaluationRow<br/>messages, tools] -->|messages → dicts| P[FireworksTrainingRolloutProcessor]
    P -->|build_prompt_token_ids| Client[FireworksV1CompletionsClient]
    Client -->|/v1/completions × n| API[Fireworks API]
    API -->|n completions<br/>with prompt_ids,<br/>completion_ids, logprobs| P
    P -->|first completion<br/>→ assistant message| OutMsgs[row.messages]
    P -->|execution_metadata.extra| Extra[prompt_ids, completion_ids,<br/>inference_logprobs, completions_text,<br/>truncated, finish_reasons]
Loading

Type of Change

  • New feature

Testing

8 new unit tests in tests/pytest/test_fireworks_training_rollout_processor.py, using a stub FireworksV1CompletionsClient so no network calls or HF tokenizers are required:

  • Per-completion extra payload has n-length lists with correct shapes and values (n=2 case).
  • n=1 still produces the list-of-lists shape with length 1 (not a scalar).
  • Trailing assistant messages are dropped by default before sampling.
  • Trailing assistant messages are preserved when the flag is disabled.
  • Missing model in completion_params raises ValueError.
  • n < 1 raises ValueError.
  • acleanup() closes every cached client.
  • Pre-existing keys in execution_metadata.extra are preserved across rollout.

All existing SingleTurnRolloutProcessor tests still pass.

$ python -m pytest tests/pytest/test_fireworks_training_rollout_processor.py tests/pytest/test_single_turn_rollout_processor.py
11 passed in 3.79s

Surface

  • New public default processor exposed via eval_protocol.pytest.FireworksTrainingRolloutProcessor.
  • No breaking changes to existing processors, RolloutProcessor base class, or EvaluationRow schema — all new data is carried through the existing execution_metadata.extra bag.

Follow-ups (not in this PR)

  • Fireworks managed RFT control plane (separate repo) will auto-select this processor for eval-v3 evaluator bundles launched from RFT jobs — tracked as FIR2-1352.
  • Fireworks control plane dataset-transform step will consume the execution_metadata.extra shape introduced here — tracked as FIR2-1353.
  • End-to-end parity test (legacy RFT path vs. this processor) on GSM8K — tracked as FIR2-1366.

A new default RolloutProcessor that drives Fireworks /v1/completions
via `FireworksV1CompletionsClient` and surfaces the per-sample
token-level data required by reinforcement fine-tuning training
(GRPO, CISPO, DAPO, GSPO).

Problem
-------
The existing `SingleTurnRolloutProcessor` uses LiteLLM chat
completions and discards token ids + inference logprobs, so scored
`EvaluationRow`s are fine for evaluation but cannot feed a training
loop. Today, teams that need training-ready rollouts write a bespoke
`RolloutProcessor` (the FrozenLake example in fw-ai/cookbook is
~800 lines). This puts token ids / logprobs out of reach of every
customer evaluator bundle unless they rewrite their own processor.

What it does
------------
For each `EvaluationRow`, `FireworksTrainingRolloutProcessor`:

* Reads model / temperature / max_tokens / n from `completion_params`.
* Builds prompt token ids locally via `FireworksV1CompletionsClient".
* Fires `n` parallel `/v1/completions` calls from the same
  `prompt_token_ids", so each completion gets independent retry
  behaviour rather than collapsing on partial server failures.
* Appends the first completion as the assistant message so existing
  evaluators that inspect `last_assistant_message()" keep scoring.
* Populates `EvaluationRow.execution_metadata.extra" with:
  - `prompt_ids: list[int]" (shared across completions)
  - `completion_ids: list[list[int]]" (per-completion)
  - `inference_logprobs: list[list[float]]" (aligned to completion tokens)
  - `completions_text: list[str]"
  - `truncated: list[bool]" (`finish_reason == 'length'")
  - `finish_reasons: list[str]"
* Merges into pre-existing `extra" rather than clobbering it.
* Caches one client per model id; closes them all via `acleanup()".

Shape rationale
---------------
OpenEnvRolloutProcessor already writes flat `prompt_ids" /
`completion_ids" concatenated across turns (multi-turn, per-episode
agent rollouts). Single-turn RFT samples n>1 completions per prompt
for advantage estimation and needs per-completion indexing, hence the
`list[list[...]]" shape here. The training adapter on the
consumer side can key into either convention without loss of
generality.

Tests
-----
8 new unit tests stub `FireworksV1CompletionsClient" so no network
calls or tokenizers are needed; existing
`SingleTurnRolloutProcessor" suite still passes.

Fixes FIR2-1351
@benjibc
Copy link
Copy Markdown
Contributor Author

benjibc commented Apr 22, 2026

Closing: not needed.

Upstream review revealed that the Fireworks managed RFT cutover does not require a training-aware RolloutProcessor. The cleaner pattern (demonstrated by fw-ai/fireworks#21366 — scripts/rollr_cispo/train_cispo_cookbook.py) is to use a plain RemoteRolloutProcessor for the rollout/eval boundary and reconstruct the training-relevant data trainer-side via:

  • local tokenization using FireworksV1CompletionsClient.build_prompt_token_ids / build_assistant_turn_token_ids
  • a prefill-logprobs pass (echo=true) against the same inference deployment the trainer uses

That pattern works for CISPO (and by extension GRPO/DAPO/GSPO) today, and it keeps eval-protocol's rollout contract single-purpose. No reason to push a training-specific default into EP.

No harm done — the module is self-contained and not wired into anything else.

@benjibc benjibc closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant